This project intends to investigate relationships between various paramters and road accidents in the UK. Licensed under Open Government Licence, the data provides detailed road safety data about the circumstances of personal injury road accidents in GB from 2005, the types (including Make and Model) of vehicles involved and the consequential casualties. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form. Source: https://data.gov.uk/dataset/road-accidents-safety-data . I will first load in the data, examine it for missing values, look at data types, visualize the data to determine important information, and then will apply some modeling concepts to see if we can predict future outcomes and to determine feature importance.
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import plotly
plotly.tools.set_credentials_file(username='w9641432', api_key='LDqLiys5zhXKvdyZc6J1')
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# For Notebooks
init_notebook_mode(connected=True)
import cufflinks as cf
cf.go_offline()
acci = pd.read_csv('D:\Python\Insight\Transport\Accidents0515.csv')
acci.head()
acci.info()
fig1, ax1 = plt.subplots(figsize=(12,8))
sns.heatmap(acci.isnull(),yticklabels=False,cbar=False,cmap='viridis')
acci = acci.drop(['LSOA_of_Accident_Location'], axis=1)
acci = acci.dropna()
acci = acci.reset_index()
acci = acci.drop(columns=['index'])
acci.shape
pf = pd.read_csv('D:\Python\Insight\Transport\Police_Force.csv')
pf.columns = pf.columns.str.replace('code','Police_Force')
acci = pd.merge(acci,pf,how='inner',on='Police_Force')
acci.head()
plt.figure(figsize=(15,8))
acci['label'].value_counts().plot(kind='bar')
This plot shows the total number of accicents per Police_Force, which is a good proxy for region. It can be seen that most accients are served by the Metropolitan police (Greater London are which serves about 9 million people). Second is the West Midlands (So Birmingham), and the Thames Valley (area stretching from Uxbridge out west towards Reading). These are heavily populated areas of England, with many highways and other A-class roads. More cars = Higher probability for accidents
The below plot shows each accident as a function of geo-location. As there are so many accidents, it actually shows a very good depiction of the British Isles (excl. NI).
plt.figure(figsize=(20,25))
g = sns.scatterplot(x="Longitude", y="Latitude", hue='label', data=acci)
plt.figure(figsize=(20,25))
tmp = acci[(acci['1st_Road_Class']==3)]
g = sns.scatterplot(x="Longitude", y="Latitude", data=tmp)
tmp = acci[(acci['1st_Road_Class']==2)]
g = sns.scatterplot(x="Longitude", y="Latitude", data=tmp)
tmp = acci[(acci['1st_Road_Class']==1)]
g = sns.scatterplot(x="Longitude", y="Latitude", data=tmp)
Britain as depicted by its major road network system due to accidents. Blue are A-class roads, whilst green and orange are Highways. These are the busiest roads in the UK by volume of traffic
acci['timestamp'] = pd.to_datetime(acci['Date'], format="%d/%m/%Y")
acci.info()
plt.figure(figsize=(15,8))
sns.set_style('whitegrid')
sns.countplot(x='1st_Road_Class',hue='Accident_Severity',data=acci,palette='rainbow')
Here, we can see that most accidents on primary roads happended in class =3, which is A class roads. It is also here where most fatalities occur. B class roads (class=4) claimed the second most ammount of lives. The "other" category (class = 6) was second highest overall. Highways/Motorways saw the least amount of accidents
tmp = acci[acci['1st_Road_Class']==1]
plt.figure(figsize=(15,8))
tmp['1st_Road_Number'].value_counts().plot(kind='bar')
Here is a list of the most dangerous motorways in Britain. The M1, M25 (or as Chris Rea called it: The road to Hell), M6, and M4 all have had more than 6000 accidents. They are also the most busy by volume
tmp = acci[(acci['1st_Road_Class']>1) & (acci['1st_Road_Class']<=3)]
plt.figure(figsize=(15,8))
tmp['1st_Road_Number'].value_counts().nlargest(40).plot(kind='bar')
Here is a plot of the 40 most dangerous A-class roads in the UK. The A1 had over 12000 accidents, whilst the A38 and A6 were closer to 10000 accidents.
tmp = acci[acci['1st_Road_Class']==4]
plt.figure(figsize=(15,8))
tmp['1st_Road_Number'].value_counts().nlargest(40).plot(kind='bar')
Here is a plot of the 40 most dangerous B-class roads in the UK. The B114 had over 1300 accidents, whilst the B2066 and B4113 were closer to 900 accidents.
tmp = acci[acci['1st_Road_Class']>0]
df = tmp.groupby(['1st_Road_Class']).mean()
ax = sns.barplot(x="1st_Road_Class", y="Number_of_Vehicles", data=acci, ci='sd', palette = "Blues_d")
Motorways (1 + 2) had the largest number of average vehicles per accident. Makes the most sense given the faster speeds and late reaction times. The bars are the standard deviation of spread around the means
ax = sns.barplot(x="1st_Road_Class", y="Number_of_Casualties", data=acci, ci='sd', palette = "Blues_d")
Motorways (1+2 ) had the highest mean number of casualties per accident. Faster cars, greater impact. Bars are the standard deviation around the mean values
mwy = len(acci[acci['1st_Road_Class'] <3]) / len(acci)
Awy = len(acci[acci['1st_Road_Class'] ==3]) / len(acci)
Bwy = len(acci[acci['1st_Road_Class'] ==4]) / len(acci)
Cwy = len(acci[acci['1st_Road_Class'] ==5]) / len(acci)
und = len(acci[acci['1st_Road_Class'] >5]) / len(acci)
labels = '1+2 (Motorway)','3 (A-Class)','4 (B-Class)','5 (C-Class)','6 (Undefined)'
sizes = [mwy,Awy,Bwy,Cwy,und]
explode = (0, 0.1,0,0,0) # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots(figsize=(8,8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
This plot show the probability of what type of accident when could be in if you were driving on the given road class with respect to the other road classes. We can see that A-Class roads (3) have the highets probability (45.4%), whilst uncategorized (6) have the second highest (~29%). What the interesting take away here is that you are ~10 times more likely to be in an accident on an A-class road in comparison to motorways (1+2), but you have a higher chance (~15%) of dying if you have an accident on a motoroway compared to an A-Class road
len(acci[acci['2nd_Road_Class'] != -1]) / len(acci)
Close to 60% of all accidents in the UK involve junctions, ie a second road.
junc = acci[acci['2nd_Road_Class'] != -1]
#g = sns.scatterplot(x="Longitude", y="Latitude", hue='Accident_Severity', data=junc)
junc1 = junc.groupby(['Latitude','Longitude','Accident_Severity','2nd_Road_Class'], as_index=False).size().sort_values(ascending=False).reset_index(name='count')
junc2 = junc1[junc1['count']>5]
junc2.head(10)
Here I have grouped the locations of each accident and then counted the total number of accidents at each location. Below I plot the positions of the each accident and the size of each circle indicates the frequency of accident. A larger circle means more accidents
import math
from ast import literal_eval
def merc(lat,lon):
r_major = 6378137.000
x = r_major * math.radians(lon)
scale = x/lon
y = 180.0/math.pi * math.log(math.tan(math.pi/4.0 +
lat * (math.pi/180.0)/2.0)) * scale
return (x, y)
maxlat = junc2['Latitude'].max()
minlat = junc2['Latitude'].min()
maxlon = junc2['Longitude'].max()
minlon = junc2['Longitude'].min()
maxlat, maxlon = merc(maxlat,maxlon)
minlat, minlon = merc(minlat,minlon)
print(maxlat,maxlon)
print(minlat,minlon)
lats=[]
lons=[]
for i in range(0,len(junc2)):
x,y = merc(junc2['Latitude'][i],junc2['Longitude'][i])
lats.append(x)
lons.append(y)
junc2['merc_lat'] = lats
junc2['merc_lon'] = lons
junc2.head()
from bokeh.plotting import figure, show, output_file
from bokeh.tile_providers import CARTODBPOSITRON
output_file("tile.html")
# range bounds supplied in web mercator coordinates
p = figure(x_range=(minlat, maxlat), y_range=(minlon, maxlon),
x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)
from bokeh.plotting import figure, show, output_notebook
from bokeh.tile_providers import CARTODBPOSITRON
# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)
p.circle(x=junc2['merc_lat'],
y=junc2['merc_lon'],
size=junc2['count'],
line_color="#FF0000",
fill_color="#FF0000",
fill_alpha=0.05)
output_notebook()
show(p)
I have used a mecator projection aid to visualize junctions that have had atleast 5 accidents. Feel free to use the features on the right side of the plot to pan around and zoom in to get a better idea (This plot is also available as UK.html). It woudl appear there are some areas more suceptible than others and if you live in Binghsm ot passing through there, there is one particular roundabout you want to avoid. 64 accidents! A combination of poor driving and bad city planning! In fact there are quite a few hotspots around Nottingham that are particularly bad
Growing up in the coutryside in the Thames Valley, I have seen and had plenty of driving experience on its roads. I am curious to know which are the most dangerous roads in this region and plan to investigate this further. We can limit our database to just include Police emergencies that involved the Thames Valley Police
tvp = acci[acci['label']=='Thames Valley']
tvp.info()
junc = tvp[tvp['2nd_Road_Class'] != -1]
junc1 = junc.groupby(['Latitude','Longitude','Accident_Severity','2nd_Road_Class'], as_index=False).size().sort_values(ascending=False).reset_index(name='count')
junc2 = junc1[junc1['count']>5]
junc2.head(10)
maxlat = junc2['Latitude'].max()
minlat = junc2['Latitude'].min()
maxlon = junc2['Longitude'].max()
minlon = junc2['Longitude'].min()
maxlat, maxlon = merc(maxlat,maxlon)
minlat, minlon = merc(minlat,minlon)
print(maxlat,maxlon)
print(minlat,minlon)
lats=[]
lons=[]
for i in range(0,len(junc2)):
x,y = merc(junc2['Latitude'][i],junc2['Longitude'][i])
lats.append(x)
lons.append(y)
junc2['merc_lat'] = lats
junc2['merc_lon'] = lons
junc2.head()
from bokeh.plotting import figure, show, output_file
from bokeh.tile_providers import CARTODBPOSITRON
output_file("tile.html")
# range bounds supplied in web mercator coordinates
p = figure(x_range=(minlat, maxlat), y_range=(minlon, maxlon),
x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)
from bokeh.plotting import figure, show, output_notebook
from bokeh.tile_providers import CARTODBPOSITRON
# range bounds supplied in web mercator coordinates
p = figure(x_axis_type="mercator", y_axis_type="mercator")
p.add_tile(CARTODBPOSITRON)
p.circle(x=junc2['merc_lat'],
y=junc2['merc_lon'],
size=junc2['count'],
line_color="#FF0000",
fill_color="#FF0000",
fill_alpha=0.05)
output_notebook()
show(p)
The Thames Valley has had most accidents around Oxford, Although, Reading and Milton Keynes (Also as html file: ThamesValley.hmtl). I lived towards the bottom right corner and there are a few junctions that are to be avoided. Best tell the friends and family!
labels = '30', '60','40','70','50','20','10','15','0'
thirty = len((acci[acci['Speed_limit']==30])) / len(acci)
sixty = len((acci[acci['Speed_limit']==60])) / len(acci)
forty = len((acci[acci['Speed_limit']==40])) / len(acci)
sevnty = len((acci[acci['Speed_limit']==70])) / len(acci)
fifty = len((acci[acci['Speed_limit']==50])) / len(acci)
twenty = len((acci[acci['Speed_limit']==20])) / len(acci)
ten = len((acci[acci['Speed_limit']==10])) / len(acci)
fifteen = len((acci[acci['Speed_limit']==15])) / len(acci)
zero = len((acci[acci['Speed_limit']==0])) / len(acci)
sizes = [thirty,sixty,forty,sevnty,fifty,twenty,ten,fifteen,zero]
explode = (0.1, 0,0,0,0,0,0,0,0) # only "explode" the 2nd slice (i.e. 'Hogs')
fig1, ax1 = plt.subplots(figsize=(8,8))
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
It can be seen that more than 60% of accidents occurred on roads that had a speed limit of 30 mph, which is typically present for most B,C, and some A class roads
pf = pd.read_csv('D:\Python\Insight\Transport\Day_of_Week.csv')
pf.columns = pf.columns.str.replace('code','Day_of_Week')
doy = acci.groupby(['Day_of_Week'], as_index=False).size().sort_values(ascending=False).reset_index(name='count')
doy = pd.merge(doy,pf,how='inner',on='Day_of_Week')
doy = doy.drop(columns=['Day_of_Week'])
fig, ax = plt.subplots(figsize=(10,4))
plt.bar(doy['label'], doy['count'], 1/1.5, color="blue")
Here is a bar chart of the total counts of accidents by day of week. Friday is worst and Sunday is best.
times = []
for i in range(0, len(acci)):
times.append(float(acci['Time'][i][0:2]))
acci['Hour'] = times
tod = acci.groupby(['Hour'], as_index=False).size().plot(kind='bar',figsize=(10,4))
This is a bar chart of the total number accidents by hour and there is a clear pattern that there are fewer accidents in the earlier hours of the morning, then peaks around rush hour with an increase during the day as the amount of activity increases maxing again at the 5 O'Clock rush and then starts to decrease again
tid = acci.groupby(['timestamp'], as_index=False).size().sort_values(ascending=False).reset_index(name='count')
tid = tid.sort_values(by=['timestamp']).reset_index()
tid = tid.drop(['index'], axis=1)
tid.head(10)
data = [go.Scatter( x=tid['timestamp'], y=tid['count'] )]
py.iplot(data, filename='pandas-time-series')
This is a plot that shows the time series of the total number of accidents each day in the UK. It can be seen that that over time there is a negative trend indicating that the roads have become safer. In order to fit a model to this time series, it would be better to produce a monthly mean series to reduce the level of noise in the data
mtid = tid.resample('M', on='timestamp').mean()
mtid['timestamp'] = mtid.index
mtid = mtid.reset_index(drop=True)
data = [go.Scatter( x=mtid['timestamp'], y=mtid['count'], marker = dict(color = '#FFBAD2' ))]
py.iplot(data, filename='pandas-time-series')
Here is a monthly mean time series of accidents. We can see the same trend as well as possible seasonal patterns in the time series. The decline in values appear linear up until 2013 where it seems to level off for the next two years. Let's assume that the relationship is linear and try and fit a SARIMA model to the time series
import warnings
import itertools
import statsmodels.api as sm
plt.style.use('fivethirtyeight')
When looking to fit time series data with a seasonal ARIMA model, our first goal is to find the values of SARIMA(p,d,q)(P,D,Q)s that optimize a metric of interest. There are many guidelines and best practices to achieve this goal, yet the correct parametrization of SARIMA models can be a painstaking manual process that requires domain expertise and time. I will resolve this issue by writing Python code to programmatically select the optimal parameter values for our SARIMA(p,d,q)(P,D,Q)s time series model.
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2) # range(0, 3) # for higher orders
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
print('Examples of parameter combinations for Seasonal ARIMA...')
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[1]))
print('SARIMAX: {} x {}'.format(pdq[1], seasonal_pdq[2]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[3]))
print('SARIMAX: {} x {}'.format(pdq[2], seasonal_pdq[4]))
I will use a "grid search" to iteratively explore different combinations of parameters. For each combination of parameters, I fit a new seasonal ARIMA model with the SARIMAX() function from the statsmodels module and assess its overall quality. Once I have explored the entire landscape of parameters, the optimal set of parameters will be the one that yields the best performance for the criteria of interest. Let's begin by generating the various combination of parameters that we wish to asses
warnings.filterwarnings("ignore") # specify to ignore warning messages
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(mtid['count'],
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print('ARIMA{}x{}12 - AIC:{}'.format(param, param_seasonal, results.aic))
except:
continue
We can now use the triplets of parameters defined above to automate the process of training and evaluating ARIMA models on different combinations. In Statistics and Machine Learning, this process is known as grid search (or hyperparameter optimization) for model selection.
When evaluating and comparing statistical models fitted with different parameters, each can be ranked against one another based on how well it fits the data or its ability to accurately predict future data points. Here, I use use the AIC (Akaike Information Criterion) value, which is conveniently returned with ARIMA models fitted using statsmodels. The AIC measures how well a model fits the data while taking into account the overall complexity of the model. A model that fits the data very well while using lots of features will be assigned a larger AIC score than a model that uses fewer features to achieve the same goodness-of-fit. Therefore, we are interested in finding models that yield the lowest AIC value.
Here, I have chosen SARIMAX with order
The code chunk above iterates through combinations of parameters and uses the SARIMAX function from statsmodels to fit the corresponding Seasonal ARIMA model. Here, the order argument specifies the (p, d, q) parameters, while the seasonal_order argument specifies the (P, D, Q, S) seasonal component of the Seasonal ARIMA model. After fitting each SARIMAX()model, the code prints out its respective AIC score.
mod = sm.tsa.statespace.SARIMAX(mtid['count'],
order=(0, 1, 0),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
results.plot_diagnostics(figsize=(15, 12))
plt.show()
I have experimented with various model setups which produced AIC scores of less than 990, but will show a specific exmaple here where the SARIMAX set up: ARIMA(0, 1, 0)x(1, 1, 1, 12)12. Here, we will only be considering seasonal features. Consulting the coefficients and p-values from the results summary we see that the most significant (using alpha=0.05) parameter is the AR of lag 12 months, although moving average of 12 months is significant at alpha=0.1. From the plots, we expect that the residuals from the observations and theoretical model are uncorrelated and randomly distributed. If the model does not exhibit these features, then we can probably improve our fit by changing some of the parameters.
The top left plot shows the residuals, which appears to be stationary and distriubted as white noise, whilst we see that the red KDE line follows closely with the N(0,1) line (where N(0,1)) is the standard notation for a normal distribution with mean 0 and standard deviation of 1). This is a good indication that the residuals are normally distributed. The qq-plot on the bottom left shows that the ordered distribution of residuals (blue dots) follows the linear trend of the samples taken from a standard normal distribution with N(0, 1). Again, this is a strong indication that the residuals are normally distributed. The correlogram shows shows that the time series residuals have low correlation with lagged versions of itself.
Those observations lead us to conclude that our model produces a satisfactory fit that could help us understand our time series data and forecast future values. Although we have a satisfactory fit, some parameters of our seasonal ARIMA model could be changed to improve our model fit. For example, our grid search only considered a restricted set of parameter combinations, so we may find better models if we widened the grid search.
pred = results.get_prediction()
pred_ci = pred.conf_int()
forecast = results.get_forecast(36)
forecast_ci = forecast.conf_int()
arima = pd.DataFrame(data=pred.predicted_mean)
arima['lower'] = pred_ci.iloc[:, 0]
arima['upper'] = pred_ci.iloc[:, 1]
arima['orig'] = mtid['count']
arima.columns = ['average','lower','upper','orig']
arima.index = pd.date_range(start='2005-01-01', end='2015-12-01', freq='MS', closed=None)
future= pd.DataFrame(data=forecast.predicted_mean)
future['lower'] = forecast_ci.iloc[:, 0]
future['upper'] = forecast_ci.iloc[:, 1]
future.columns = ['average','lower','upper']
future.index = pd.date_range(start='2015-01-01', end='2017-12-01', freq='MS', closed=None)
fig, ax = plt.subplots(figsize=(24,8))
arima['orig'][1:].plot(ax=ax, style='k-', label='Observations')
arima['average'][25:121].plot(ax=ax, style='b-', label='One-step-ahead Prediction')
pred_ci = pred.conf_int(alpha=0.05)
pred_index = arima.index
ax.fill_between(pred_index[25:121], arima['lower'][25:121], arima['upper'][25:121], alpha=0.3)
future['average'].plot(ax=ax, style='r', label='Forecast Linear')
forecast_index = future.index
ax.fill_between(forecast_index, future['lower'], future['upper'], alpha=0.1)
# Cleanup the image
#ax.set_ylim((4, 8));
legend = ax.legend(loc='lower left');
y_model = arima['average'][25:]
y_truth = arima['orig'][25:]
# Compute the mean square error
mse = ((y_model - y_truth) ** 2).mean()
print('The Mean Squared Error of the SARIMA is {}'.format(round(mse, 2)))
y_forecasted = future['average'][:13]
y_truth = arima['orig'][122:]
# Compute the mean square error
mse = ((y_forecasted - y_truth) ** 2).mean()
print('The Mean Squared Error of the Forecast is {}'.format(round(mse, 2)))
Here, I have plotted the observations and the SARIMA model using the chosen parameters for pdq and PDQs. The SARIMA captures alot of the seasonality and behaviour of the observations as well as capturing the downward trend. The model fit based on the MSE is 588. Ideally, we would want this number as close to zero as possible, and by using differing paramters this could be explored. The forecast is shown in red and predicts three years into the future. The SARIMA is stopped intentionally short by one year, acting as training data, and one is left over as test data to compare to the foecast. The MSE of the forecast is 160, which is fairly good. The range of uncertainty increases for the forecast over time.
This model is fairly basic. We have assumed a linear relationship with accidents over time, but this assumption might not be true. Further investigation would be needed to see if more advanced SARIMA models would fit the data better.
From the previous section, we saw that the roads have become safer over time, but can we be confident that this is the case? Here, I perform a significance test using a T-Test to determine if this is the case. To do this, I will compare mean values and their distributions of accidents in 2005 and 2015.
Null-Hypothesis Ho = That the road safety in 2005 and 2015 is the same. Alternative Hypothesis Ha = Roads are safer in 2015.
We are looking a one-tailed test and I will use an alpha = 0.05
tid['year'] = tid.timestamp.map(lambda x: x.strftime('%Y'))
tid.head()
fig, ax = plt.subplots(figsize=(9,6))
tid.loc[tid['year'] == '2005', 'count'].hist(bins=30, alpha = 1)
tid.loc[tid['year'] == '2015', 'count'].hist(bins=30, alpha = 0.7)
ax.legend(['2005', '2015'])
Here is a histogram of the distributions of accidents in 2005 and 2015. The distributions between 2005 and 2015 show gaussian tendencies, so we can assume that we can use Gaussian statistics
tof = tid.loc[tid['year'] == '2005', 'count'].mean()
tofstd = tid.loc[tid['year'] == '2005', 'count'].std()
tff = tid.loc[tid['year'] == '2015', 'count'].mean()
tffstd = tid.loc[tid['year'] == '2015', 'count'].std()
print(str(tof) + ' +/- ' + str(tofstd))
print(str(tff) + ' +/- ' + str(tffstd))
import matplotlib.mlab as mlab
hox = np.linspace(tof - 3*tofstd, tof + 3*tofstd, 100)
plt.plot(hox,mlab.normpdf(hox, tof, tofstd))
hax = np.linspace(tff - 3*tffstd, tff + 3*tffstd, 100)
plt.plot(hax,mlab.normpdf(hax, tff, tffstd))
plt.show()
# Student t-distribution Percent Point Function
from scipy.stats import t
# define probability
p = 0.95
df = 364 # n-1
# retrieve value <= probability
value = t.ppf(p, df)
print(value)
# confirm with cdf
p = t.cdf(value, df)
print(p)
Critical T-value to beat in order to get a significant result and hence reject the Null hypothesis
## Import the packages
import numpy as np
from scipy import stats
## Define 2 random distributions
#Sample Size
N = 365
a = tid.loc[tid['year'] == '2005', 'count']
b = tid.loc[tid['year'] == '2015', 'count']
## Calculate the Standard Deviation
#Calculate the variance to get the standard deviation
#For unbiased max likelihood estimate we have to divide the var by N-1, and therefore the parameter ddof = 1
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)
#std deviation
s = np.sqrt((var_a + var_b)/2)
## Calculate the t-statistics
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))
# Compare with the critical t-value
#Degrees of freedom
df = 2*N - 2
#p-value after comparison with the t
p = 1 - stats.t.cdf(t,df=df)
print("t = " + str(t))
print("p = " + str(p))
## Cross Checking with the internal scipy function using Welch's T-Test
t2, p2 = stats.ttest_ind(a,b,equal_var=False)
print("t = " + str(t2))
print("p = " + str(p2))
As we are comparing two means, we can use the independent T-Test (Welch's as variances differ). If the calculated t-statistic is greater than the critical t-value, the test concludes that there is a statistically significant difference between the two populations. Therefore, you reject the null hypothesis that there is no statistically significant difference between the two populations. In this case, the T-statistic is 26 >> 1.65, plus we can concur from the P-value, which << 0.05 that we can reject the Null hypothesis and conclude that roads are significantly safer by a gigantic margin compared to 2005.
a = tid.loc[tid['year'] == '2005', 'count']
b = tid.loc[tid['year'] == '2006', 'count']
t3, p3 = stats.ttest_ind(a,b,equal_var=False)
print("t = " + str(t3))
print("p = " + str(p3))
In fact, by 2006, the roads were significantly safer than 2005 according to the above T-test comparison. T-stat = 3.7 > T-crit = 1.65 rendering a p-value = 0.0002 << alpha = 0.05
Examing this dataset I noticed a column that states if accidents occurred in rural or urban areas. This sets up the possibility to predict a binary classifier problem using input parameters.
acci.loc[acci.Urban_or_Rural_Area != 2, 'Urban_or_Rural_Area'] = 0
acci.loc[acci.Urban_or_Rural_Area == 2, 'Urban_or_Rural_Area'] = 1
acci.groupby(['Urban_or_Rural_Area'], as_index=False).size().plot(kind='bar',figsize=(10,4))
tod = acci.groupby(['Urban_or_Rural_Area'], as_index=False).size().sort_values(ascending=False).reset_index(name='count')
urban = tod['count'][0] / (tod['count'][0] + tod['count'][1])
rural = tod['count'][1] / (tod['count'][0] + tod['count'][1])
print('Urban: ' + str(urban))
print('Rural: ' + str(rural))
The first thing we need to establish is the relationship between accidents occurring in both urban and rural areas. This is important because we want to be able to identify any class imbalance. If one class has many more instances than the other, then standard classifiers tend ot be overwhelmed by the larger class and ignore the smaller one. Here we see that about 2/3 of all accidents happen in urban areas. I will first try to predict with the classses as they are, but will try ammend any problem that occurs later
X = acci.drop(['Accident_Index',
'Location_Easting_OSGR',
'Location_Northing_OSGR',
'Longitude',
'Latitude',
'Police_Force',
'Accident_Severity',
'Number_of_Vehicles',
'Number_of_Casualties',
'Time',
'Junction_Control',
'Local_Authority_(District)',
'Local_Authority_(Highway)',
'1st_Road_Number',
'2nd_Road_Number',
'Urban_or_Rural_Area',
'Did_Police_Officer_Attend_Scene_of_Accident',
'timestamp',
'label',
'Date'
], axis=1)
X.head()
We need to build a data set with independent variables that can help classify our output. These would be parameters that are plausible impactors to accidents. They are not the reason why the accidents occurred, but they are the challenges and stressors placed on drivers that can ultimately lead to an accident. I have created the above table with such stressors. Also included is day or week and hour of day as I have shown the impact of time as being an important factor automobile movements. More traffic implies large stress factor
plt.figure(figsize = (16,5))
sns.heatmap(X.corr(),cmap='coolwarm',annot=True)
This correlation plot is intended to show if there are any parameters are show collinear behaviour. General Linearised and linear regression models don't work well if the predicting features are similar (i.e. collinear) as they have a hard time assigning coefficient values to multicollinear paramters and can reak havoc in general. Only one relationship is showing some form of possible collinearity and that is 2nd_road_class and Junction_detail (0.69). We can investigate this further by consulting the value inflation score of these two parameters.
import scipy as sp
val=data
ck = np.column_stack([acci['2nd_Road_Class'],acci['Junction_Detail']])
cc = sp.corrcoef(ck, rowvar=False)
VIF = np.linalg.inv(cc)
VIF.diagonal()
We would be worried about collinearity VIF scores were above 5. Here, we see that the VIF is ~ 2, which is low enough for us not to be too concerned about a possible collinearity issue
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,acci['Urban_or_Rural_Area'], test_size=0.30)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
pred_yt = logmodel.predict(X_train)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
print('Accuracy test: ' + str(accuracy_score(y_test, predictions)))
print('Accuracy train: ' + str(accuracy_score(y_train,pred_yt)))
print(classification_report(y_test,predictions))
confusion_matrix(y_test,predictions)
Confusion matrix =
TP = 328047, = Correctly classified as Urban
TN = 129940, = Correctly classified as Rural
FP = 15607, = Classified as Urban when Rural
FN = 60516, = Classified Rural when Urban
Above are the results of the LR model. First off, the percentage of TP (TP/N) and TN (TN/N) classified correctly by the model are below.
TP/N = 61% = Urban. ==> 39% Rural
in comparison to 65 % Urban and 35% Rural in the data itself.
This overall result shows that eventhough there is a class difference in our predicted variable, the LR model still does a pretty good job of classiying the data to about the same ratio. This would imply we would not need to over/undersample our training data. The miss-classification (FP and FN)rates are also divided similarly.
The overall accuracy ((TN + TP) / N) score here for training and test sets is about 85%. Accuracy does not always provide a useful metric for classification especially if there is a large class imbalance. However, we have shown that the model classifies to the ratio in the data fairly well. Essentially our model is predicting a true classification 8.5 out of 10. Another important takeaway here is that the both training and test data fit well with the assigned model, hence we have good reason to believe that there is not any reason to worry about overfitting.
More useful metrics are precision (TP / (TP + FP)) and recall (TP/ (TP + FN)).
Precision:
When it predicts yes, how often is it correct?. The precision for each class are fairly similar (84,89 %). Precision is a good measure to determine, when the costs of False Positive is high
Recall:
When the result is yes, how often is it correct?. The recall is very different between the classes. For Urban classification recall was close to 100%, but was close to 70% for rural classification. Basically, the model had a harder time with rural prediction of true vales. We know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.
F1 Score:
F1 Score (2 (recallprecision / (recall+precision))) is needed when you want to seek a balance between Precision and Recall. Right…so what is the difference between F1 Score and Accuracy then? We have previously seen that accuracy can be largely contributed by a large number of TN which in most business circumstances, we do not focus on much whereas False Negative and False Positive usually has business costs (tangible & intangible), thus F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives). Here our f1-scores are 0.9 ans 0.77 for Urban and rural roads, respectively. The overall f1-score is 0.85. The higher the f1 score, the better
feature_importance = abs(logmodel.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance %')
plt.figure(figsize = (16,5))
plt.tight_layout()
plt.show()
This plot shows the relative importance of each feature. Each feature is weighted to the maximum overall coefficient value (hence normalized). We see that the speed limit parameter is by far the most important feature, whilst Pedestrian crossing physical facilities and road type are second and third. This makes some conceptual sense given what we know or hear about road accidents in the news or via some other form of information.
from sklearn.model_selection import train_test_split
score_test=[]
score_train=[]
tmp = pd.DataFrame(np.nan, index=X_train.columns.values, columns=[1])
for i in range(0,10):
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0, n_estimators=100, max_depth=5, max_features=13,
min_samples_leaf=3, min_samples_split=2, bootstrap=True, criterion='gini')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn import metrics
score_test.append(classifier.score(X_test,y_test))
score_train.append( classifier.score(X_train,y_train))
feature_import = pd.DataFrame(data=classifier.feature_importances_, index=X_train.columns.values, columns=[i])
tmp = pd.concat([tmp,feature_import], axis=1, sort=False)
print('train: ' + str(np.sum(score_train)/10))
print('test: '+ str(np.sum(score_test)/10))
print(metrics.confusion_matrix(y_test, y_pred))
print(metrics.classification_report(y_test, y_pred))
#met.roc_auc_score(y_test, y_pred)
Here, I perform a random forest classification to compare to the LR model. We can see that the model compares well to LR, if not performing a little better. There is more consistency between recall values as well as f1-scores, whilst yielding favourable overall precision and recall values
feature_importances = pd.DataFrame(100*classifier.feature_importances_,index = X_train.columns,columns=['importance']).sort_values('importance', ascending=False)
feature_importances
Feature importances are provided from the random forest. The agree well with the LR model, although road type is 2nd in the list and Pedestrian physical crossing facilities is bumped to third.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logmodel.predict(X_test))
logit_roc_auc1 = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logmodel.predict_proba(X_test)[:,1])
fpr1, tpr1, thresholds1 = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure(figsize = (16,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot(fpr1, tpr1, label='Random Forest (area = %0.2f)' % logit_roc_auc1)
plt.plot([0, 1], [0, 1],'g--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
The Receiver operating characteristic is an important tool in the evaulation of the model's results. ROC is a probability curve and AUC (Area under the curve) represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. The dashed line represents the model as a random classifier and the blue curve and orange curves are the LR and RF models, respectively. We want the model lines and the random classifier line to be far away from each other as possible. Ideally, we would want two model curves has and as to the left as possible as this would indicate an excellent classification rate to TPs to FPs. The AUC falls between 0 and 1, with a higher number meaning a better classification rate. We see here that the RF model performs slightly better than the LR model. A totally random model would have a AUC of 0.5
In this quick project, I have examined British accident data to find some interesting behaviour and relationships, as well as, tried to model and predict future accident evolution, and determine some of the most important features that are related to these accidents. There is so much more to be discovered here with this data, but for now, this is a good start!
Perhaps the most important finding is that I have shown that there are significantly less accidents in 2015 compared to 2005, but still a major concern and the most important reason for accidents is due to speeding.